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Abstract 


sentation of target appearance is against such challenges. 


We propose an online visual tracking algorithm 
by learning discriminative saliency map using 
Convolutional Neural Network (CNN). Given a 
CNN pre-trained on a large-scale image reposi¬ 
tory in offline, our algorithm takes outputs from 
hidden layers of the network as feature descrip¬ 
tors since they show excellent representation per¬ 
formance in various general visual recognition 
problems. The features are used to learn discrim¬ 
inative target appearance models using an online 
Support Vector Machine (SVM). In addition, we 
construct target-specific saliency map by back- 
propagating CNN features with guidance of the 
SVM, and obtain the final tracking result in each 
frame based on the appearance model genera- 
tively constructed with the saliency map. Since 
the saliency map visualizes spatial configuration 
of target effectively, it improves target localiza¬ 
tion accuracy and enable us to achieve pixel-level 
target segmentation. We verify the effectiveness 
of our tracking algorithm through extensive ex¬ 
periment on a challenging benchmark, where our 
method illustrates outstanding performance com¬ 
pared to the state-of-the-art tracking algorithms. 


1. Introduction 

Object tracking has played important roles in a wide range 
of computer vision applications. Although it has been 
studied extensively during past decades, object tracking 
is still a difficult problem due to many challenges in real 
world videos such as occlusion, pose variations, illumina¬ 
tion changes, fast motion, and background clutter. Success 
in object tracking relies heavily on how robust the repre- 


For this reason, reliable target appearance modeling prob¬ 
lem has been investigated in recent tracking algorithms ac- 


tively (Bao et al., 2012; Jia et al., 

2012^ 

Mei & Ling 

2009 

Zhang et al. 2012; Zhong et al. 

2012 

Ross et al. 

2004 

Han et al., 2008; Babenko et al. 

2011 

Hare et al. 

2011 

Grabner et al., 2006; Saffari et al. 

2010), which are classi- 


fied into two major categories depending on learning strate¬ 
gies: generative and discriminative methods. In generative 
framework, the target appearance is typically described by 
a statistical model estimated from tracking results in pre¬ 
vious frames. To maintain the target appearance model, 
various approaches have been proposed including sparse 
representation ( Bao et al. |2012[ Jia et al. |2012 |Mei & 


Ling, 2009| |Zhang et al. 2012| Zhong et al.||2012 ), online 

density e stimation (|Han et ah 2008| ), incremental subspace 
learning ( [Ross et al 2004|), etc. On th e other hand, dis¬ 
criminative framework ( [Babenko et al.| [2011 ; Har e et al.| 
|2011[ |Grabner et al.[ |2006[ Saffari et al. 2010| ) aims to 
learn a classifier that discriminates target from surround¬ 
ing background. Various learning algorithms have been in¬ 
corporated including online boosting |Grabner et al 


etaT[E __ _ , 

201 1| ), and online random forest ( [Gall et al~| |2011[|Schul-| 


2006 


Saffari et al., 2010), multiple instance learning (Babenko 
et al.[ 2011 ), structured support vector machi ne ([H are et al. 


ter et al.|[2011 ). These approaches are limited to using too 


simple and/or hand-crafted features for target representa¬ 
tion, such as template, Haar-like features, histogram fea¬ 
tures and so on, which may not be effective to handle latent 
challenges imposed on video sequences. 

Convolutional Neural Network (CNN) has recently drawn 
a lot of attention in computer vision community due to its 
representation power. (Krizhevs ky et al. 2012| ) trained a 
network using 1.2 million images for image classification 
and demonstrated significantly improved performance in 
ImageNet challenge ( [Berg et ak| 2012| ). Since the huge 
success of this work, CNN has been applied to represent- 
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Figure 1. Overall procedure of the proposed algorithm. Our tracker exploits a pre-trained CNN for both image representation and target 
localization. Given a set of samples on the input frame, we first extract their features using a pre-trained CNN (Section|3Tj, and classify 
them by the online SVM trained until the previous time step. For each positive sample, we back-propagate the features relevant to 
target, which are identified by observing the model parameter of the SVM, through the network to obtain a saliency map of the sample 
that highlights the regions discriminating target from background. The saliency maps of the positive examples are aggregated to build 
the target-specific saliency map (Section [T2| . Finally, tracking is performed by a sequential Bayesian filtering using the target-specific 
saliency map as observation. To this end, a generative model is learned from target appearances in the previous saliency maps, and a 
dense likelihood map is calculated by convolution between the appearance model and the target-specific saliency map (Section [33] ). 
Based on the tracking result of the current frame, the SVM and generative model are updated for subsequent tracking (Section [3~4] ). 


ing images or objects in various computer vision tasks in¬ 


et al. 

2014; He et al.| 2014 

), object recognition ( 

Oquab 

et al. 

20141 

Donahue et al. 

2014 

Zhang et al., 

2014), 


pose estimation ( [Toshev & Szegedy||2014| ), image segmen 
tation ([Hariharan et al. 2014| ), image stylization ( Karayev 


[etaLj|2014D, etc. 

Despite such popularity, there are only few attempts to em¬ 
ploy CNNs for visual tracking since offline classifiers are 
not appropriate for visual tracking conceptually and online 
learning based on CNN is not straightforward due to large 
network size and lack of training data. In addition, the fea¬ 
ture extraction from the deep structure may not be appropri¬ 
ate for visual tracking because the visual features extracted 
from top layers encode semantic information and exhibit 


relatively poor localization performance in general. (Fan 
et al., |2010| ) presents a human tracking algorithm based on 


a network trained offline, but it needs to learn a separate 
class-specific network to track other kind of objects. On the 
other hand, ([Li et al.[ |2014| ) proposes a target-specific CNN 
for object tracking, where the CNN is trained incremen¬ 
tally during tracking with new examples obtained online. 
The network used in this work is shallow since learning a 
deep network using a limited number of training examples 
is challenging, and the algorithm fails to take advantage 
of rich information extracted from deep CNNs. There is a 


tracking algorithm based on a pre-trained network (Wang 
& Yeung||2013| ), where a stacked denoising autoencoder is 


trained using a large number of images to learn generic im¬ 
age features. Since this network is trained with tiny gray 
images and has no shared weight, its representation power 
is limited compared to recently proposed CNNs. 


We propose a novel tracking algorithm based on a pre¬ 
trained CNN to represent target, where the network is 
trained originally for large-scale image classification. On 
top of the hidden layers in the CNN, we put an additional 
layer of an online Support Vector Machine (SVM) to learn 
a target appearance discriminatively against background. 
The model learned by SVM is used to compute a target- 
specific saliency map by back-propagating the information 
relevant to target to input layer (Simonyan et al. 2014| ). We 
exploit the target-specific saliency map to obtain genera¬ 
tive target appearance models (filters) and perform tracking 
with understanding of spatial configuration of target. The 
overview of our algorithm is illustrated in Figure [T] and the 
contributions of this paper are summarized below: 


• Although recent tracking methods based on CNN typ¬ 
ically attempt to learn a network in an online man¬ 
ner (Li et al., 2014 ), our algorithm employs a pre¬ 
trained CNN to represent generic objects for tracking 
and achieves outstanding performance empirically. 


• We propose a technique to construct a target-specific 
saliency map by back-propagating only relevant fea¬ 
tures through CNN, which overcomes the limitation of 
the existing method to visualize saliency correspond- 
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ing to the predefined classes only. This technique also 
enable us to obtain pixel-level target segmentation. 

• We learn a simple target-specific appearance filter on¬ 
line and apply it to the saliency map; this strategy 
improves target localization performance even with 
shift-invariant property of CNN-based features. 

The rest of this paper is organized as follows. We first de¬ 
scribe the overall framework of our algorithm in Section[2] 
and the detailed methodology is discussed in Section [3 
The performance of our algorithm is presented in Section 4 


2. Overview of Our Algorithm 


Our tracking algorithm employs a pre-trained CNN to rep¬ 
resent target. In each frame, it first draws samples for can¬ 
didate bounding boxes near the target location in the previ¬ 
ous frame, takes their image observations, and extracts fea¬ 
ture descriptors for the samples using the pre-trained CNN. 
We found out that the features from the CNN capture se¬ 
mantic information of target effectively and handle various 
geometric and photometric transformations successfully as 
reported in ( Qquab et ah] |2014| |Karayev et al.| |2014[ |Don-| 


ahue et ah) |2014| ). However, it may lose some spatial in¬ 

formation of the target due to pooling operations in CNN, 
which is not desirable for tracking since the spatial config¬ 
uration is a useful cue for accurate target localization. 


Our tracking algorithm is then formulated as a sequen¬ 
tial Bayesian filtering framework using the target-specific 
saliency map for observation in tracking. A generative ap¬ 
pearance model is constructed by accumulating target ob¬ 
servations in target-specific saliency maps over time, which 
reveals meaningful spatial configuration of target such as 
shape and parts. A dense likelihood map of each frame 
is computed efficiently by convolution between the target- 
specific saliency map and the generative appearance model. 
The overall algorithm is illustrated in Figure [I] 

Our algorithm exploits the discriminative properties of on¬ 
line SVM, which helps generate target-specific saliency 
map. In addition, we construct the generative appearance 
model from the saliency map and perform tracking through 
sequential Bayesian filtering. This is a natural combination 
of discriminative and generative approaches, and we take 
the benefits from both frameworks. 

3. Proposed Algorithm 

This section describes the comprehensive procedure of our 
tracking algorithm. We first discuss the features obtained 
from pre-trained CNN. The method to construct target- 
specific saliency map are presented in detail, and how the 
saliency map can be employed for constructing generative 
models and tracking object is described. After that, we 
present online SVM technique employed to learn target ap¬ 
pearance in a discriminative manner sequentially. 


To fully exploit the representation power of CNN features 
while preserving spatial information of target, we adopt the 
target-specific saliency map as our observation for tracking, 
which is generated by back-propagating target-specific in¬ 
formation of CNN features to input layer. This technique 
is inspired by ( Simony an et al.| 2014| ), where class-specific 
saliency map is constructed by back-propagating the infor¬ 
mation corresponding to the identified label to visualize the 
region of interest. Since target in visual tracking problem 
belongs to an arbitrary class and its label is unknown in ad¬ 
vance, the model for target class is hard to be pre-trained. 

Hence, we employ an online SVM, which discriminates 
target from background by learning target-specific infor¬ 
mation in the CNN features; the target-specific information 
learned by the online SVM can be regarded as label infor¬ 
mation in the context of (Simonyan et al. 2014| ). The SVM 
classifies each sample, and we compute the saliency map 
for each positive example by back-propagating its CNN 
feature along the pre-trained CNN with guidance of the 
SVM till the input layer. Each saliency map highlights re¬ 
gions discriminating target from background. The saliency 
maps of the positive examples are aggregated to build the 
target-specific saliency map. The target-specific saliency 
map alleviates the limitation of CNN features for tracking 
by providing important spatial configuration of target. 


3.1. Pre-Trained CNN for Feature Descriptor 


To represent target appearances, our tracking algorithm em¬ 
ploys a CNN, which is pre-trained on a large number of 
images. The pre-trained generic model is useful especially 
for online tracking since it is not straightforward to col¬ 
lect a sufficient number of training data. In this paper, R- 
CNN (Girshick et al. 2014| ) is adopted as the pre-trained 
model, but other CNN models can be used alternatively. 
Out of the entire network structure, we take outputs from 
the first fully-connected layer as they tend to capture gen¬ 
eral characteristics of objects and have shown excellent 
generalization performance in many other domains as de¬ 
scribed in ( Donah ue et al.[|MT4| ). 


For a target proposal x^, the CNN takes its correspond¬ 
ing image observation z i as its input, and returns an output 
from the first fully-connected layer </>(x^) as a feature vec¬ 
tor of x^. We apply the SVM to each CNN feature vector 
0(x^) and classify x^ into either positive or negative. 


3.2. Target-Specific Saliency Map Estimation 

For target tracking, we first compute SVM scores of candi¬ 
date samples represented by the CNN features and classify 
them into target or background. Based on this information, 




























Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network 


one naive option to complete tracking is to simply select 
the optimal sample with the maximum score as 

x* = argmax w T 0(x^). 
i 


However, this approach typically has the limitation of in¬ 
accurate target localization since, when calculating 0(x^), 
the spatial configuration of target may be lost by spatial 
pooling operations ([Fan et al., 2010 ). 


To handle the localization issue while enjoying the effec¬ 
tiveness of CNN features, we propose the target-specific 
saliency map, which highlights discriminative target re¬ 
gions within the image. This is motivated by the class- 
specific saliency map discussed in (Simon yan et al. 2014 ). 
The class-specific saliency map of a given image I is the 
gradient of class score S C (I) with respect to the image as 


9c(I) 


dS c (I) 

dl 


( 1 ) 


The saliency map is constructed by back-propagation. 
Specifically, let ..., and ..., F^ denote 
the transformation functions and their outputs in the net¬ 
work, where o o o f^\x) and 

S C (I) = F^ L \ Eq. ([]]) is computed using chain rule as 

dS c {I) _ dF W dF( L -V dF C 1 ) 

dl 3F( L-2 ) dl 



Figure 2. An example of target-specific saliency map. The face 
of a person in left image is being tracked. The target-specific 
saliency map reveals meaningful spatial configuration of the tar¬ 
get, such as eyes, a nose and lips. 


in 0(x^) is positive due to ReLU operations in CNN learn¬ 
ing. Then, we obtain the target-specific feature </> + (x^) as 

, + , x _ / w k 4> k (-Xi), if w k > 0 
^/c v x v | o, otherwise 5 


where ) denotes the k -th entry of 0(x^). Then the 
gradient of target-specific feature 0 + (x^) with respect to 
the image observation is obtained by 


fl(xi) 


^ + (x») 

dz i 


( 6 ) 


Since the gradient is computed only for the target-specific 
information 0+(xi), pixels to distinguish the target from 
background would have high values in gfa). 


Intuitively, the pixels that are closely related to the class c 
affect changes in S c more, which means that nearby regions 
of such pixels would have high values in saliency map. 

When calculating such saliency map for object tracking, we 
impose target-specific information instead of class mem¬ 
bership due to the reasons discussed in Section [2] For 
the purpose, we adopt the SVM weight vector w = 
(uq,..., w n ) T , which is learned online to discriminate 
between target and background. Since the last fully- 
connected layer corresponds to the online SVM, the out¬ 
puts of the last two layers in our network are given by 

F (i) = w T F (i “ 1) + b (3) 

■F (L_1) = 0( Xi ). (4) 


Plugging Eq. <[3]» and (|4]> into Eq. (|2ji, the gradient map of 
the target proposal x^ is given by 


S(*i) 


dF ( L ) dF( L -V _ T /<90(xj)\ 

qf(l-i) g Zi w y g Zi J ’ 


(5) 


where z i is the image observation of x z . 

Instead of using all entries in </>(x^) to generate target- 
specific saliency map, we only select the dimensions corre¬ 
sponding to positive weights in w since they have clearer 
contribution to make x^ positive. Note that every element 


The target-specific saliency map M is obtained by aggre¬ 
gating g(~Xi) of samples with positive SVM scores in im¬ 
age space. As g(x^) is defined over sample observation 
z i, we first project it to image space and zero-pad outside 
of z 2 ; we denote the result by Gi afterwards. Then, the 
target-specific saliency map is obtained by taking the pix¬ 
el wise maximum magnitude of the gradient maps Gi s cor¬ 
responding to positive examples, which is given by 

M(p) = max \Gi(p)\, Vi e {j |w T 0(xj) + b > 0}, (7) 

l 

where p denotes pixel location. We suppress erroneous ac¬ 
tivations from background by considering only positive ex¬ 
amples when aggregating sample gradient maps. An exam¬ 
ple of target-specific saliency map is illustrated in Figure[2] 
where strong activations typically come from target areas 
and spatial layouts of target are exposed clearly. 

3.3. Target Localization with Saliency Map 

Given the target-specific saliency map at frame t denoted 
by M t , the next step of our algorithm is to locate the target 
through sequential Bayesian filtering. Let x t and M t de¬ 
note the state and observation variables at current frame t, 
respectively, where saliency map is used for measurement. 
The posterior of the target state p(x t \M\ :t ) is given by 

p(x t \M 1:t ) ocp(M t |x t )p(x t |M 1:t _ 1 ), (8) 
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where p(x. t \Mi :t -i) denotes the prior distribution pre¬ 
dicted from the previous time step, and p(M t |x t ) means 
observation likelihood. 

The prior distribution p(x t |Mi :t _i) of target state at the 
current time step is estimated from the posterior at the pre¬ 
vious frame through prediction, which is given by 

p(x t \M 1:t -i) = yV(xt|x t _i)p(x t _i|Mi :t _i)dx t _i, (9) 

where p(x t |x t _i) denotes a state transition model. Target 
dynamics between two consecutive frames is given by a 
simple linear equation as 


x £ — x £-l + + £t, (10) 

where d t denotes a displacement of target location, and e t 
indicates a Gaussian noise. Both d t and e t are unknown 
before tracking in general, but is estimated from the sam¬ 
ples classified as target by our online SVM in our case. 
Specifically, d t and e t are given respectively by 

d i = P-t — x t *_i, 0,E t ), (11) 

where xj : _ 1 denotes the target location at the previous 
frame, and /i t and indicate mean and variance of loca¬ 
tions of positive samples at the current frame, respectively. 
From Eq. (TO] ) and ( pT| ), the transition model for prediction 
is derived as follows: 


p(x t |x t _i) = V(x t -x t _i;d t ,£ t ). (12) 

Since the transition model is linear with Gaussian noise, 
computation of the prior in Eq. can be performed ef¬ 
ficiently by transforming the posterior p(pc t -i\Mi : t-i) at 
the previous step by d t and applying Gaussian smoothing 
with covariance 

The measurement density function p(M t |x t ) represents the 
likelihood in the state space, which is typically obtained 
by computing the similarity between the appearance mod¬ 
els of target and candidates. In our case, we utilize M t , 
target-specific saliency map at frame t , for observation to 
compute the likelihood of each target state. Note that pixel- 
wise intensity and its spatial configuration in the saliency 
map provide useful information for target localization. At 
frame t , we construct the target appearance model H t given 
the previous saliency maps in a generative way. Let 

Mfc(x£) denote the target filter at frame fc, which is ob¬ 
tained by extracting the subregion in M at the location 
corresponding to the optimal target bounding box given by 
x£. The appearance model H t is constructed by aggregat¬ 
ing the recent target filters as follows: 

1 

H t = ~ E ( 13 ) 

k=t—m 


where m is a constant for the number of target filters to be 
used for model construction. The main idea behind Eq. (\3\ 
is that the local saliency map nearby the optimal target lo¬ 
cation in a frame plays a role as a filter to identify the target 
within the saliency map in the subsequent frames. Since the 
target filter is computed based on m recent filters, we need 
to store the m filters to update the target filter. Therefore, 
given the appearance model defined in Eq. ( [13] ), the obser¬ 
vation likelihood p(M t | x t ) is computed by simple convo¬ 
lution between H t and M t by 

p(M t |x t ) ocH t ® M t (x t ), (14) 


where ( 8 ) denotes convolution operator. This is similar to 
the procedure in object detection, e.g., ( [Felzenszwalb et al.[ 
2010| ), where the filter is constructed from features to rep¬ 


resent the object category and applied to the feature map to 
localize the object by convolution. 


Given the prior in Eq. ([9} and the likelihood in Eq. ( p~4] ), 
the target posterior at the current frame is computed simply 
by applying Eq. ft Once the target posterior is obtained, 
the optimal target state is given by solving the maximum a 
posteriori problem as 


x* = arg max p(x t | M \. t ). (15) 

X 

Once tracking at frame t is completed, we update the clas¬ 
sifier based on xj, which is discussed next. 


3.4. Discriminative Model Update by Online SVM 

We employ an online SVM to learn a discriminative model 
of target. Our SVM can be regarded as a fully-connected 
layer with a single node but provides a fast and exact solu¬ 
tion in a single pass to learn a model incrementally. 

Given a set of samples with associated labels, {(x-, ?/•)}, 
obtained from the current tracking results, we hope to up¬ 
date a weight vector w of SVM. The label y[ of a new ex¬ 
ample x' is given by 


V 


/ 

i 


+1, if x' = Xj 

_i if BB(x))nBB(x') r , (16) 

A 11 BB(x*)uBB(x') ^ 0 


where BB(x) denotes the bounding box corresponding to 
the given state x and S denotes a pre-defined threshold. 
Note that the examples with the bounding box overlap ra¬ 
tios larger than S are not included in the training set for our 
online learning to avoid drift problem. 

Before discussing online SVM, we briefly review the opti¬ 
mization procedure of an offline learning algorithm. Given 
training examples {(x^,^)}, the offline SVM learns a 
weight vector w = ..., w n ) T by solving a quadratic 

convex optimization problem. The dual form of SVM ob- 
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jective function is given by 

min : W = ]- V - V ai+b Y] y^, (17) 

\O Zi 

i,j i i 


Poggio, 2000) for more details. Also, note that any other 


methods for online SYM learning, such as LaSVM (Bordes 


|et al.[|200~5] ) and LaRank ( |Bordes et al.[|2007| ), can also be 
adopted in our framework. 


where {a^} are Largrange multipliers, b is bias, and Qij = 
HiHjK((Xi , Xj). In our tracking algorithm, the kernel func¬ 
tion is defined by the inner product between two CNN fea¬ 
tures, i.e., iT(x^Xj) = 0(xi) T 0(x :/ ). In online tracking, 
it is not straightforward for conventional QP solvers to han¬ 
dle the optimization problem in Eq. (17] ) as training data are 
given sequentially, not at once. Incremental SVM ( [Diehl & 
Cauwenberghs| |2003[ |Cauwenberghs & Poggio| |2000 ) is 


an algorithm designed to learn SVMs in such cases. The 
key idea of the algorithm is to retain KKT conditions on 
all the existing examples while updating model with a new 
example, so that it guarantees an exact solution at each in¬ 
crement of dataset. Specifically, KKT conditions are the 
first-order necessary conditions for the optimal solution of 
Eq. (fTT), which are given by 


dW 

Oat 

dW 

~db 


{ > 0, if a, = 0 
= 0, if 0 < ai < C (18) 
< 0, if ai = C, 

Y^yj a j=0, (19) 

3 


where is related to the margin of the i-th example 
that is denoted by rrii afterwards. By the conditions in 
Eq. ^ each training example belongs to one of the fol¬ 
lowing three categories: Ei for support vectors lying on the 
margin (m^ = 0), E 2 for support vectors inside the margin 
(rrii < 0), and E% for non-support vectors. 

Given the k -th example, incremental SVM estimates its La- 
grangian multiplier a & while retaining the KKT conditions 
on all the existing k — 1 training examples. In a nutshell, 
cik is initialized to 0 and updated by increasing its value 
over iterations. In each iteration, the algorithm estimates 
the largest possible increment Athat guarantees KKT 
conditions on the existing examples, and updates ak and 
existing model parameters with Aa/~. This iterative proce¬ 
dure will stop when the k -th example becomes a support 
vector or at least one existing example changes its mem¬ 
bership across Ei, E 2 , and E$. We can generalize this on¬ 
line update procedure easily when multiple examples are 
provided as new training data. With the new and updated 
Lagrangian multipliers, the weight vector w is given by 


w = aiyi4>(-Xi). (20) 

ieE 1 UE 2 


For efficiency, we maintain only a fixed number of support 
vectors with smallest margins during tracking. We ask to 
refer to ( [Diehl & Ca uwenberghs, 2003; Cauwenbe rghs & 


4. Experiments 

This section describes our implementation details and ex¬ 
perimental setting. The effectiveness of our tracking algo¬ 
rithm is then demonstrated by quantitative and qualitative 
analysis on a large number of benchmark sequences. 


4.1. Implementation Details 


For feature extraction, we adopt the R-CNN model built 
upon the Caffe library ( pia||2013[ ). The CNN takes an image 
from sample bounding box, which is resized to 227 x 227, 
and outputs a 4096-dimensional vector from its first fully- 
connected (fee) layer as a feature vector corresponding to 
the sample. To generate target candidates in each frame, 
we draw N(= 120) samples from a normal distribution as 
X; ~ J\f(x.$_ l7 Vwh/2), where w and h denote the width 
and height of target, respectively. The SVM classifier and 
the generative model are updated only if at least one exam¬ 
ple is classified as positive by the SVM. When generating 
training examples for our SVM, the threshold S in Eq. ( p~6| ) 
is set to 0.3. The number of observations m used to build 
generative model in Eq. ( p~3j ) is set to 30. To obtain seg¬ 
mentation mask, we employ Grab Cut ( Rother et al. 2004 ), 
where pixels that have saliency value larger than 70% of 
maximum saliency are used as foreground seeds, and back¬ 
ground pixels around the target bounding box up to 50 pix¬ 
els margin are used as background seeds. All parameters 
are fixed for all sequences throughout our experiment. 


4.2. Analysis of Generative Appearance Models 

The generative model H t is used to localize the target using 
the target-specific saliency map. As described earlier, the 
target-specific saliency map shows high responses around 
discriminative target regions; our generative model exploits 
such property and is constructed using the saliency maps in 
the previous frames. Figure [3] illustrates examples of the 
learned generative models in several sequences. Generally, 
the model successfully captures parts and shape of an ob¬ 
ject, which are useful to discriminate the target from back¬ 
ground. More importantly, the distribution of responses 
within the model reveals the spatial configuration of the 
target, which provides a strong cue for precise localization. 
This can be clearly observed in examples of face and doll, 
where the scores from the areas of eyes and nose can be 
used to localize the target. When target is not rigid (e.g., 
person), we observe that the model has stronger responses 
on less deformable parts of the target (e.g., head) and local¬ 
ization relies more on the stable parts consequently. 
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Figure 3. Examples of generative models learned by our algo¬ 
rithm. In each example, the left and right image indicate the target 
and learned model, respectively. 


4.3. Evaluation 

Dataset and compared algorithms To evaluate the per¬ 
formance, we employ all 50 sequences from the recently 
released tracking benchmark dataset (Wu et al., |2013| ). 
The sequences in the dataset involve various tracking chal¬ 
lenges such as illumination variation, deformation, mo¬ 
tion blur, background clutter, etc. We compared our 
method with top 10 trackers in ( |Wu et ah 2013] ), which 
include SC M (|Zhong et al.] 2012 ), Struck ( Hare et ah] 


12011 ), TLD flKalal e~ 


CXT 

VTS 

LSK 


([Dinh et ah] |2011 


Miller 


201 2|7aSLA <pTa et al.U2012f, 


, VTD (]Kwon & Lee| 


Kwon & Lee 2011] ), CSK ([Henrique s et al. 


20101, 


2012 ]), 


iu et al.| 2011| ) and DFT ( |Sevilla-Lara & Learned-| 


|2012| ). We used the reported results in ( |Wu et al.[ 


2013j ) for these tracking algorithms. 


Evaluation methodology We follow the evaluation pro¬ 
tocols in ( Wu et al.[ 2013] ), where the performance of track¬ 
ers are measured based on two different metrics: success 
rate and precision plots. In both metrics, the ratio of suc¬ 
cessfully tracked frames is measured by a set of thresholds, 
where bounding box overlap ratio and center location error 
are employed in success rate plot and precision plot, re¬ 
spectively. We rank the tracking algorithms based on Area 
Under Curve (AUC) for success rate plot and center loca¬ 
tion error at 20 pixels for precision plot. 


Quantitative results in bounding box We evaluate our 
method quantitatively and make a comparative study with 
other methods in all the 50 benchmark sequences; the re¬ 
sults are summarized in Figure [4] for both of success rate 
and precision plots. In both measures, our method outper¬ 
forms all other trackers with substantial margins. It is prob- 
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Figure 4. Average success plot (top) and precision plot (bottom) 
over 50 benchmark sequences. Numbers in the legend indicate 
overall score of each tracker calculated by area under curve and 
distance at 20 pixels for success plot and precision plot. 


ably because the CNN features are more effective to repre¬ 
sent high-level concept of target than hand-crafted ones al¬ 
though the network is trained offline for other purpose. We 
also compare our full algorithm with its reduced version 
denoted by OurssvM, which depends only on SVM scores 
as conventional tracking-by-detection algorithms do. Our 
full algorithm achieves non-trivial performance improve¬ 
ment over the reduced version, which shows that our gener¬ 
ative model based on target-specific saliency map is useful 
to localize target in general. 

To gain more insight about the proposed algorithm, we 
evaluate the performance of trackers based on individual 
attributes provided in the benchmark dataset. Note that the 
attributes describe 11 different types of tracking challenges 
and are annotated for each sequence. Table [T] and [2] sum¬ 
marize the results in two different measures. The numbers 
next to the attributes indicate the number of sequences in¬ 
volving the corresponding attribute. As illustrated in the ta¬ 
bles, our algorithm consistently outperforms other methods 
in almost all challenges, and our full algorithm is generally 
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Table 1. Average success rate scores on individual attributes. Red: best, blue: second best. 



DFT 

LSK 

CSK 

VTS 

VTD 

CXT 

ASLA 

TLD 

Struck 

SCM 

OurssvM 

Ours 

Illumination variation (25) 

0.383 

0.371 

0.369 

0.429 

0.420 

0.368 

0.429 

0.399 

0.428 

0.473 

0.522 

0.556 

Out-of-plane rotation (39) 

0.387 

0.400 

0.386 

0.425 

0.434 

0.418 

0.422 

0.420 

0.432 

0.470 

0.524 

0.582 

Scale variation (28) 

0.329 

0.373 

0.350 

0.400 

0.405 

0.389 

0.452 

0.421 

0.425 

0.518 

0.456 

0.513 

Occlusion (29) 

0.381 

0.409 

0.365 

0.398 

0.403 

0.372 

0.376 

0.402 

0.413 

0.487 

0.539 

0.563 

Deformation (19) 

0.439 

0.377 

0.343 

0.368 

0.377 

0.324 

0.372 

0.378 

0.393 

0.448 

0.623 

0.640 

Motion blur (12) 

0.333 

0.302 

0.305 

0.304 

0.309 

0.369 

0.258 

0.404 

0.433 

0.298 

0.572 

0.565 

Fast motion (17) 

0.320 

0.328 

0.316 

0.300 

0.302 

0.388 

0.247 

0.417 

0.462 

0.296 

0.545 

0.545 

In-plane rotation (31) 

0.365 

0.411 

0.399 

0.416 

0.430 

0.452 

0.425 

0.416 

0.444 

0.458 

0.501 

0.571 

Out of view (6) 

0.351 

0.430 

0.349 

0.443 

0.446 

0.427 

0.312 

0.457 

0.459 

0.361 

0.592 

0.571 

Background clutter (21) 

0.407 

0.388 

0.421 

0.428 

0.425 

0.338 

0.408 

0.345 

0.458 

0.450 

0.519 

0.593 

Low resolution (4) 

0.200 

0.235 

0.350 

0.168 

0.177 

0.312 

0.157 

0.309 

0.372 

0.279 

0.438 

0.461 

Weighted average 

0.389 

0.395 

0.398 

0.416 

0.416 

0.426 

0.434 

0.437 

0.474 

0.499 

0.554 

0.597 


Table 2. Average precision scores on individual attributes. Red: best, blue: second best. 



DFT 

LSK 

CSK 

VTS 

VTD 

CXT 

ASLA 

TLD 

Struck 

SCM 

OurssvM 

Ours 

Illumination variation (25) 

0.475 

0.449 

0.481 

0.573 

0.557 

0.501 

0.517 

0.537 

0.558 

0.594 

0.725 

0.780 

Out-of-plane rotation (39) 

0.497 

0.525 

0.540 

0.604 

0.620 

0.574 

0.518 

0.596 

0.597 

0.618 

0.745 

0.832 

Scale variation (28) 

0.441 

0.480 

0.503 

0.582 

0.597 

0.550 

0.552 

0.606 

0.639 

0.672 

0.679 

0.827 

Occlusion (29) 

0.481 

0.534 

0.500 

0.534 

0.545 

0.491 

0.460 

0.563 

0.564 

0.640 

0.734 

0.770 

Deformation (19) 

0.537 

0.481 

0.476 

0.487 

0.501 

0.422 

0.445 

0.512 

0.521 

0.586 

0.870 

0.858 

Motion blur (12) 

0.383 

0.324 

0.342 

0.375 

0.375 

0.509 

0.278 

0.518 

0.551 

0.339 

0.764 

0.745 

Fast motion (17) 

0.373 

0.375 

0.381 

0.353 

0.352 

0.515 

0.253 

0.551 

0.604 

0.333 

0.735 

0.723 

In-plane rotation (31) 

0.469 

0.534 

0.547 

0.579 

0.599 

0.610 

0.511 

0.584 

0.617 

0.597 

0.720 

0.836 

Out of view (6) 

0.391 

0.515 

0.379 

0.455 

0.462 

0.510 

0.333 

0.576 

0.539 

0.429 

0.744 

0.687 

Background clutter (21) 

0.507 

0.504 

0.585 

0.578 

0.571 

0.443 

0.496 

0.428 

0.585 

0.578 

0.716 

0.789 

Low resolution (4) 

0.211 

0.304 

0.411 

0.187 

0.168 

0.371 

0.156 

0.349 

0.545 

0.305 

0.536 

0.705 

Weighted average 

0.496 

0.505 

0.545 

0.575 

0.576 

0.575 

0.532 

0.608 

0.656 

0.649 

0.780 

0.852 



Figure 5. Qualitative results for selected sequences: (from left to right) MotorRolling , FaceOccl , Lemming , Jogging , Tiger , Basketball 
and David3. (Rowl) Comparisons to other trackers. (Row2) Target-specific saliency maps. (Row3) Segmentation by GrabCut with 
target-specific saliency maps. 


better than its reduced version. 

Quantitative results in segmentation The proposed algo¬ 
rithm produces pixel-wise target segmentation using target- 
specific discriminative saliency map. To evaluate segmen¬ 
tation accuracy, we select 9 video sequences from the on¬ 
line tracking benchmark datasef]and annotate ground-truth 

1 Since accurate annotation of segmentation is labor intensive 
and time consuming, we selected a subset of sequences (typically 
short ones) for evaluation. 


segmentation for each sequence. The selected sequences 
cover various attributes in tracking challenges, and the list 
of sequences with associated attributes are summarized in 
Table [3] 

The segmentation performance of the proposed algorithm 
is evaluated based on the overlap ratio—intersection over 
union—between ground-truth and identified target segmen¬ 
tation. As other trackers used for comparison may not be 
able to generate pixel-wise segmentation, we employ their 
bounding box outputs as segmentation masks and compute 
the overlap ratio with respect to the ground-truth segmen- 
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Table 3. List of sequences and their attributes used for segmen¬ 
tation performance evaluation. The set of sequences contains 10 
attributes (out of 11 altogether) such as illumination variations 
(IV), out-of-plane rotation (OPR), scale variations (SV), occlu¬ 
sion (OCC), deformation (DEF), motion blur (MB), fast motion 
(FM), in-plane rotation (IPR), background clutter (BC) and low 
resolution (LR). The numbers in parentheses denote the number 
of frames. 


Sequence name 

Attributes 

Bolt (350) 

OPR, OCC, DEF, IPR 

Coke (291) 

IV, OPR, OCC, FM, IPR 

Couple (140) 

OPR, SC, DEF FM, BC 

Jogging (307) 

OPR, OCC, DEF 

MotorRolling (164) 

IV, SC, MB, FM, IPR, BC, LR 

MountainBike (228) 

OPR, IPR, BC 

Walking (412) 

SC, OCC, DEF 

Walking2 (500) 

SC, OCC, LR 

Woman (597) 

IV, OPR, SC, OCC, DEF, MB, FM 


Success plot 



Figure 6. Average success plot over 9 selected sequences. Num¬ 
bers in the legend indicate overall scores calculated by AUC. 

tation. The results are presented by success plot as in Fig¬ 
ure [6] where Ours seg denotes the proposed algorithm with 
target segmentation. According to Figure [6} our method 
outperforms all other trackers with substantial margin. Es¬ 
pecially, we can observe a large performance improvement 
of the proposed target segmentation algorithm over our 
bonding box trackers denoted by Ours and OurssvM- It 
suggests that the proposed target-specific saliency map is 
sufficiently accurate to estimate the target area in a video 
thus can be utilized to further improve tracking. 

Qualitative Results We present the results of several se¬ 
quences in Figure [5j where original frames with tracking 
results, target-specific saliency maps, and segmentation re¬ 
sults are illustrated. We can observe that our algorithm 


also demonstrates superior performance to other algorithms 
qualitatively. 

5. Conclusion 

We proposed a novel visual tracking algorithm based on 
pre-trained CNN, where outputs from the last convolu¬ 
tional layer of the CNN are employed as generic feature de¬ 
scriptors of objects, and discriminative appearance models 
are learned online using an online SVM. With CNN fea¬ 
tures and learned discriminative model, we compute the 
target-specific saliency map by back-propagation, which 
highlights the discriminative target regions in spatial do¬ 
main. Tracking is performed by sequential Bayesian fil¬ 
tering with the target-specific saliency map as observation. 
The proposed algorithm achieves substantial performance 
gain over the existing state-of-the-art trackers and shows 
the capability for target segmentation. 
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